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Extreme Self-Organization in Networks Constructed from Gene Expression Data 
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We study networks constructed from gene expression data obtained from many types of cancers. The 
networks are constructed by connecting vertices that belong to each others' list of A'-nearest-neighbors, 
with K being an a priori selected non-negative integer. We introduce an order parameter for characterizing 
the homogeneity of the networks. On minimizing the order parameter with respect to K, degree distribu- 
tion of the networks shows power-law behavior in the tails with an exponent of unity. Analysis of the 
eigenvalue spectrum of the networks confirms the presence of the power-law and small-world behavior. 
We discuss the significance of these findings in the context of evolutionary biological processes. 

PACS numbers: 89.75.-k, 05.65.+b, 87.23.Kg, 87.18.Sn 



Recent technical advancements have led to widespread 
use of gene chips for quantizing and monitoring the expres- 
sion level of thousands of genes in parallel Presently, 
gene expression profiling has become an important tool for 
diagnosis and classification of diseases. It is being used 
extensively for identifying genes responsible for specific 
conditions, e.g., various cancers [||, ||, |4[ ||, ^J. This is 
done using specialized clustering techniques developed in 
recent years [^, 0, ||]. Gene expression data can also be 
used for constructing networks of coexpressed and coregu- 
lated genes. Since proteins are the end product of gene ex- 
pression, various types of protein networks [ |lc| ] and gene 
networks are directly related. Consequently, networks of 
coregulated genes are expected to play key role in biolog- 
ical processes. In this letter we outline results that show 
the relevance of these networks in evolutionary biological 
processes. 

The volume of gene expression data obtained from typi- 
cal experiments is enormous and contains information on 
expression of all the genes (presently almost 10000 or 
more) marked on the chip. In any given condition most 
of the genes are not important and do not express. As a re- 
sult, before the expression data can be used for construct- 
ing networks, it requires extensive processing and filter- 
ing to eliminate uninformative genes. We skip these de- 
tails here as they can be found with the source of the data 
[§, g @, §, g] and elsewhere |§ [|. Henceforth, we as- 
sume that expression data for the selected set of informa- 
tive genes is available in the form of a matrix having TV 
rows and D columns. The rows represent the genes and the 
columns represent the samples/tissues. Furthermore, the 
expression values of each of the genes in this matrix are 
normalized to have a mean of zero and variance of unity 
across the samples. 

The normalized expression levels are treated as coordi- 
nates of N genes, that form the vertices of the networks, in 
D dimensional space of samples. The network construction 
algorithm requires specification of the maximum number 
of neighbors K, < K < N, that a vertex can have. For a 
given K, gene network is constructed using the following 
two step procedure, (i) For each vertex i, i = 1, . . . , N, 



make a list L, of its first K nearest neighbors, (ii) Connect 
all vertices i and j through an edge if i £ Lj and j G L,, 
otherwise the vertices are not connected. This algorithm 
is derived from the /^-nearest-neighbor parameter estima- 
tion method [O], With some heuristic modifications it has 
been used earlier in clustering analysis of various types of 
data We used Euclidean norm as the distance measure 
for making the list of K nearest neighbors. Other distance 
measures can also be used. The results presented herein 
remain unaltered as long as the distance measure preserves 
the ordering of points obtained from the Euclidean mea- 
sure. 

For a given data set, the topological structure of networks 
generated by this algorithm depends strongly on the param- 
eter K. For K = 0, the network consists of TV isolated ver- 
tices, and for K = N — 1 each vertex is connected to all the 
other vertices. For most of the values of K, these networks 
have more than one connected component. As K increases, 
the connectivity of each vertex grows depending on its lo- 
cal environment. Vertices that lie close to each other tend to 
get mutually connected in preference to those lying farther 
away. Thus, all connected components in these networks 
have a small-world structure. Furthermore, for K > 3 the 
networks have a giant connected component. 

We analyzed networks constructed using several pub- 
lished gene expression data sets. To ensure that our re- 
sults are not affected by possible bias of technology used 
for manufacturing the gene chips, we used expression data 
sets obtained using both oligonucleotide arrays [Q, |3| |j] as 
well as cDNA microarrays [^|, 01. For each network we 
calculated P{z) the probability of finding a vertex of de- 
gree z. P{z) is normalized by N so that L,-P(z) = 1. Since 
these networks are small, P(z) is very noisy. As a result, 
we calculated the cumulative probability distribution 



(1) 



where z max is the maximum degree in the network. 

We also define a quantity c = z + 1 ■ This quantity gives 
the size of the smallest cluster around a vertex with con- 
nectivity z that includes the vertex and its neighbors. It can 
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also be considered as the size of a "droplet" that is formed 
by the vertex and its neighbors. Since, the smallest and the 
largest values of z are and z max , the corresponding val- 
ues for c become 1 and c max = 1 + z m a X , where c max is the 
largest droplet size. The probability density P(c) and cu- 
mulative distribution F(c) corresponding to c are defined 
similarly to those for z. The linear relationship of c and 
Z implies that P{c) = P(z) and F{c) = F(z). Outside the 
range of z and c, corresponding probability density func- 
tions are defined to be zero. Thus, F(c)| c <i = F(z)\ z <o = 1 
mdF(c)\ c>Cim =F(z)\ z>Zim =0. 

The homogeneity of the networks can be characterized 
by a single order parameter. A suitable candidate for this is 
the area A enclosed by the F(c) versus c* = c/c max curve 
between c* = and 1. It takes values in the range to 
1 depending on the homogeneity of the network. Since 
Cmax is finite and the values of c are evenly spaced, A(K) is 
easily calculated using Trapezoidal rule and equals 

c raax A(tf) = \ [1 - J P(z max )] + £(i + l)P(i). (2) 

The terms with factor of 1 /2 represent the area of strips at 
the boundary. Their contribution vanishes as the number of 
strips increases. It is zero if P(z) is a delta function. 

From Eq. (||) the value of c max A(K) is easily identified 
as the mean size c = 1 + z of the droplets, where z is the av- 
erage degree of vertices in the network. Thus, A(K) is the 
average droplet size normalized with the size of the largest 
droplet in the network. It measures separation between the 
mean and maximum droplet sizes in the network and func- 
tions as an indicator of the the overall behavior of F(z) and 
P{z)- Very small values of A(K) imply that F(z) descends 
sharply from unity to almost zero at a small z <C Zmax and 
becomes nearly flat with plateau stretching until z m a X - The 
corresponding P(z) has long tails with weight centered at 
small z- A(K) close to unity implies that F(z) stays nearly 
flat at unity for most of z < Zma X and descends sharply to 
zero at some z ~ z m a X - In this case P(z) is sharply peaked 
similar to a delta function near z max - Intermediate values 
of A(K) imply a decaying F(z) corresponding to various 
forms of P(z) including those that rise slowly to a peak and 
then decay slowly via sharply truncated power-law tails. 

Figure |] (Alon99) shows the behavior of A(K) for net- 
works constructed using colon cancer data of Alon et al. 
[0]. The figure shows, that as K is increased from 1 to 
N — I, A(K) initially decreases and attains a minimum at 
some value of K = K\ (here K\ k, 16). The minimum is 
nearly flat and persists until K — Kj (here K% ~ 24). As K is 
increased beyond K2, A{K) continues to increase, reaching 
its maximum value of 1 at K = N — 1 . Fig. [T] (Pero99-26) 
shows similar behavior of A(K) in networks constructed 
using breast cancer data of Perou et al. [|J with a second 
minimum at K w 3N/4. The second minimum is usually 
a finite size effect. It can also occur if there is high in- 
homogeneity on length scales large compared to the near- 
est neighbors scale. It is very shallow compared to that 
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FIG. 1: Variation of A(K) with K for networks constructed from 
colon cancer data of Alon et al ^ and breast cancer data of Perou 
et al The insets show blow up of a small zone around the 
minima. The keys are as in [f24]]. 

between K\ and K2 because the probability density func- 
tion corresponding to the size of "droplets" on larger length 
scales are sharper compared to P(z) and also have smaller 
tails. The higher order minima, however, are not relevant. 
Behavior similar to that seen in Fig. [j] was observed in net- 
works from several other gene expression data sets also 

[||@. 

The behavior of A(A') divides the networks in three 
classes, (i) The networks corresponding to 1 < K < K\ 
have few connections between vertices but have high ho- 
mogeneity, (ii) The networks for K\ < K < K2 are some- 
what better connected and highly inhomogeneous. (iii) In 
the networks for Kj < K < N — 1 , the vertices have many 
connections and high homogeneity that increases with K. 

Figure [| shows the variation of the observed cumula- 
tive probability distribution F{z) with the normalized de- 
gree (z + 1 ) / (zma X + 1 ) in a wide range of values of K for 
networks constructed from many gene expression data sets 
[|| |3j |], ||, |p. It is clear from the figure that all the curves 
in the range A4 < K < K2 (solid circles) show a very good 
collapse, and thus good scaling, for all the data sets. This 
range of K houses the minimum of the order parameter 
and the corresponding networks are highly inhomogeneous 
with loosely connected vertices. The solid straight line 
passing through the tails of these curves is a least-square 
fit of the form 

F{z) = a-b\n(^-\ (3) 

\Zma X +l/ 

where a and b are the fit parameters. For the graphs shown 
in the figure, these parameters vary in the range 0.03 < a < 
0.08 and 0.47 < b < 0.71 over all the data sets. 
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FIG. 2: Variation of cumulative probability distribution function F (z) with normalized degree (z + 1 ) / (z max + 1 ) in networks constructed 
from several data sets. The keys are as in pij ]. In the graphs, data for different types of networks are plotted using different line styles. 
Solid circles are used in the range of K corresponding to minima of A(K). The straight solid line is least-square fit of the form given 
in Eq. (^) in the tails of F(z). The curves drawn with dashed lines approach the solid circles as K increases (here, in steps of 2). The 
curves drawn with dotted lines go away from the solid circles as K increases (here, in steps of 50). 



The extremely good fit of Eq. (^) in the tails of F(z) 
seen in Fig. ^ implies that the corresponding probability 
density functions have a scale-free behavior of the form 
P(z) ~ b(z + in the tails. The power law is seen in 
the range 0.35 < (z+ l)/(z max + 1) < 1, i.e., in nearly 60% 
to 65% of the range of variation of vertex degree. As this 
range is small for observing heavy tailed distributions, we 
analyze the eigenvalue spectrum of the adjacency matrix 
of networks [|l2|] to have another evidence of the scale-free 
character of the networks. 

The spectral density p (A,) of the eigenvalue spectrum of 
the adjacency matrix of networks 
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where A, ; is the y'th eigenvalue of the adjacency matrix, is a 
good indicator of the overall behavior of their degree distri- 
bution P(z) and topological structure. For random graphs 
having a giant connected component p(A,) is known to con- 
verge to a semicircle following the Wigner's law [||]. De- 
viations from Wigner's law are seen for other cases. For 
the small- world networks p(k) shows a complex highly 
skewed structure with several blurred peaks [O]. For scale 
free networks, the spectral density has a triangular shape 
with the central part lying above the semicircle [O], 

Figure [3] shows the spectral density of networks con- 
structed from colon cancer data of Alon et al. ^ for dif- 
ferent values of K corresponding to the three zones of be- 
havior of A(K) seen previously. The figure shows that for 
small K the spectral density has an irregular shape with 



several blurred peaks and its bulk is confined below the 
semicircle. This is a characteristic of small-world networks 
[|I2|]. As K is increased, the bulk portion starts assum- 
ing a triangular shape which is a little skewed for small 
K. The top portion of the triangle starts protruding above 
the semicircle as K crosses K\ . This is clear from the in- 
termediate and high values of K in the figure. The trian- 
gular shape persists till K becomes almost equal \o N —1. 
At K = N — 1 , the spectral density develops a delta func- 
tion peak corresponding \o N — \ repeated eigenvalues at 
X = — 1 and the largest eigenvalue is Xi = N — 1 . This 
behavior of the spectral density was seen in all the other 
data sets also [^, |[ |5| ^|. It confirms that these gene net- 
works have small-world character and become scale-free 
for K>K X . 

Presence of scale-free behavior indicates high degree of 
self-organization in the system and is known to be a charac- 
teristic of natural systems [O]. It has been observed in sev- 
eral natural and artificial networks, e.g., power grid [14], 
the Internet [15], the World Wide We b rtlol ], actor network 
[0, web of human sexual contacts [|l8[], citation and col- 
laboration networks Jl9| ], conceptual network of language 
[pot], metabolic network [^I|], food web [jffi], and protein 
networks [|l(^]. The exponents of the power laws observed 
earlier, however, were always more than unity. The gene 
networks studied here are first examples of biological net- 
works showing scale-free behavior with exponent of unity. 

The scale-free character of coexpressed gene networks 
means that these networks are extremely inhomogeneous 
and contain few genes that are very highly connected and a 
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FIG. 3: Spectral density for networks constructed from colon cancer data of Alon et al. ||2|] for different values of K corresponding 
to the three zones of behavior of the order parameter A(K) (see Fig. [j]). p is the fraction of edges, out of the maximum possible 
N(N — l)/2, present in the network. Semicircle corresponding to spectral density of random networks is drawn for comparison. 



large number of genes with low connectivity. This implies 
that these networks contain large groups of coexpressed 
genes. As a result, the present study conclusively shows, 
using direct experimental data, that various types of can- 
cers are a consequence of malfunction of only a few genes 
that either regulate the expression of a large number of 
other genes or form the hubs of various crowded gene reg- 
ulatory pathways functioning in the organism. It is known 
that disturbing such genes could be fatal for the organism 
[p3|], which also turns out to be the mechanism of origin 
of cancers. Identification of such genes and understanding 
their functionality under various conditions through which 
an organism can pass in its lifetime is directly relevant in 
the design of highly targeted drugs, among other possibili- 
ties. 

The simultaneous presence of small-world and scale-free 
characters in these networks seems to be a perfect fit in the 
evolutionary scheme of biological systems. The high ro- 
bustness displayed by biological systems is a consequence 
of the scale-free character of the associated networks. On 
the other hand, the fast reaction and rapid adaptability 
shown by biological systems can come only if the asso- 
ciated networks have a small-world character. This fits the 
structure of biological signaling system well because the 
chemical signaling employed at most places in biological 
systems, by its very nature, is very slow compared to, e.g., 
electrical signaling in neurons. Thus, for achieving fast 
message transmission, the associated networks must evolve 
to have small-world character. 
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